Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping

نویسندگان

Jefrey Lijffijt

Panagiotis Papapetrou

Kai Puolamäki

Heikki Mannila

چکیده

Comparing frequency counts over texts or corpora is an important task in many applications and scientific disciplines. Given a text corpus, we want to test a hypothesis, such as “word X is frequent”, “word X has become more frequent over time”, or “word X is more frequent in male than in female speech”. For this purpose we need a null model of word frequencies. The commonly used bag-of-words model, which corresponds to a Bernoulli process with fixed parameter, does not account for any structure present in natural languages. Using this model for word frequencies results in large numbers of words being reported as unexpectedly frequent. We address how to take into account the inherent occurrence patterns of words in significance testing of word frequencies. Based on studies of words in two large corpora, we propose two methods for modeling word frequencies that both take into account the occurrence patterns of words and go beyond the bag-of-words assumption. The first method models word frequencies based on the spatial distribution of individual words in the language. The second method is based on bootstrapping and takes into account only word frequency at the text level. The proposed methods are compared to the current gold standard in a series of experiments on both corpora. We find that words obey different spatial patterns in the language, ranging from bursty to non-bursty/uniform, independent of their frequency, showing that the traditional approach leads to many false positives.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bootstrapping Large Sense Tagged Corpora

The performance of Word Sense Disambiguation systems largely depends on the availability of sense tagged corpora. Since the semantic annotations are usually done by humans, the size of such corpora is limited to a handful of tagged texts. This paper proposes a generation algorithm that may be used to automatically create large sense tagged corpora. The approach is evaluated through comparative ...

متن کامل

Modeling Tweet Arrival Times using Log-Gaussian Cox Processes

Research on modeling time series text corpora has typically focused on predicting what text will come next, but less well studied is predicting when the next text event will occur. In this paper we address the latter case, framed as modeling continuous inter-arrival times under a logGaussian Cox process, a form of inhomogeneous Poisson process which captures the varying rate at which the tweets...

متن کامل

Extracting Constraints on Word Usage from Large Text Corpora

Our research focuses on the identification of word usage constraints from large text corpora. Such constraints are important for natural language systems, both for the problem of selecting vocabulary for language generation and for disambiguating lexical meaning in interpretation. The first stage of our research involves the development of systems that can automatically extract such constraints...

متن کامل

Applying the Espresso-algorithm to large parsed corpora

Information extraction systems learn patterns for extracting pairs instantiating a given relation from text. For instance, for the relation capital of a system might learn extraction patterns such as ‘Arg1 is capital of Arg2’, or ’The embassador of arg2 was called back to Arg1’. Lightly supervised information extraction systems learn extraction patterns by means of a bootstrapping procedure, wh...

متن کامل

Sary: Reusable Components and Tools for Searching Large Corpora

Since corpus-based natural language processing has to deal with large corpora, efficient searching of the large corpora is inevitably necessary. For example, one might want to examine how a word or a phrase is used in the large corpora or to collect frequencies of all terms in the large corpora. Our system Sary solves these problems by providing fast full-text search facilities for a single lar...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping

نویسندگان

چکیده

منابع مشابه

Bootstrapping Large Sense Tagged Corpora

Modeling Tweet Arrival Times using Log-Gaussian Cox Processes

Extracting Constraints on Word Usage from Large Text Corpora

Applying the Espresso-algorithm to large parsed corpora

Sary: Reusable Components and Tools for Searching Large Corpora

عنوان ژورنال:

اشتراک گذاری